Just as the grammar of language helps us construct meaningful sentences out of words, the Grammar of Graphics helps us to construct graphical figures out of different visual elements. This grammar gives us a way to talk about parts of a plot: all the circles, lines, arrows, and words that are combined into a diagram for visualizing data. The components of a plot are :
These components are organized into layers, where each layer has a single geometric object, statistical transformation, and position adjustment. Following this grammar, you can think of each plot as a set of layers of images, where each image’s appearance is based on some aspect of the data set.
Today we’ll be looking at the mpg dataset and explore it visually using ggplot2
This dataset provides “Fuel economy data from 1999 and 2008 for 38 popular models of cars”. The dataset is shipped with ggplot2 package.
Import the dataset into R
# Load ggplot2 library
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
dim(mpg)
## [1] 234 11
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manu~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manu~ f 18 26 p comp~
In order to create a plot, you:
# create canvas
ggplot(data = mpg) + ggtitle("Canvas")
# variables of interest mapped
ggplot(data=mpg, mapping = aes(x = displ, y = hwy)) + ggtitle("Canvas + variables mapped to axis")
Every ggplot2 plot has three key components:
data,
A set of aesthetic mappings between variables in the data and visual properties, and
At least one geom, geometric object, which describes how to render each observation.
ggplot(data = mpg, aes(x = as.factor(year))) +
geom_bar(fill="grey") +
xlab("Year")
table(mpg$class)
##
## 2seater compact midsize minivan pickup subcompact
## 5 47 41 11 33 35
## suv
## 62
# Right column: no y mapping needed!
ggplot(data = mpg, aes(x = class)) +
geom_bar(fill="blue") +
xlab("Vehicle Class") +
ylab("Count")
#flip x and y axis with coord_flip
ggplot(mpg, aes(x = class)) +
geom_bar(fill="blue") +
coord_flip()
ggplot(data=mpg,aes(x=manufacturer)) + geom_bar(aes(fill = class)) +
coord_flip()
ggplot(data = mpg, aes(x = hwy)) +
geom_histogram(fill="red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mpg, aes(hwy)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mpg, aes(hwy)) +
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Both histograms and frequency polygons work in the same way: they bin the data, then count the number of observations in each bin. The only difference is the display: histograms use bars and frequency polygons use lines.
How do we decide on bin width?
ggplot(mpg, aes(hwy)) +
geom_histogram(binwidth = 1, fill="green")
library(gridExtra)
library(ggridges)
## Warning: package 'ggridges' was built under R version 3.4.4
##
## Attaching package: 'ggridges'
## The following object is masked from 'package:ggplot2':
##
## scale_discrete_manual
p1 <- ggplot(mpg, aes(hwy)) +
geom_histogram(binwidth = 1) +
ggtitle("Bin width = 1 Mile")
p2 <- ggplot(mpg, aes(hwy)) +
geom_histogram(binwidth = 10) +
ggtitle("Bin width = 10 Miles")
p3 <- ggplot(mpg, aes(hwy)) +
geom_histogram(binwidth = 20) +
ggtitle("Bin width = 20 Miles")
p4 <- ggplot(mpg, aes(hwy)) +
geom_histogram(binwidth = 30) +
ggtitle("Bin width = 30 Miles")
grid.arrange(p1, p2, p3, p4, ncol = 2)
ggplot(mpg, aes(displ, colour = drv)) +
geom_freqpoly(binwidth = 0.5)
ggplot(mpg, aes(displ, fill = drv)) +
geom_histogram(binwidth = 0.5)
This produces a scatterplot defined by:
Note that when you added the geom layer you used the addition (+) operator. As you add new layers you will always use + to add onto your visualization.
Observation - The plot shows a strong correlation: as the engine size gets bigger, the fuel economy gets worse.
The following code is identical to the example above
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
ggtitle("Data plotted between hwy and displ")
The aesthetic mappings take properties of the data and use them to influence visual characteristics, such as position, color, size, shape, or transparency. Each visual characteristic can thus encode an aspect of the data and be used to convey information.
All aesthetics for a plot are specified in the aes() function call (later in this tutorial you will see that each geom layer can have its own aes specification).
For example, we can add a mapping from the class of the cars to a color characteristic:
table(mpg$class)
##
## 2seater compact midsize minivan pickup subcompact
## 5 47 41 11 33 35
## suv
## 62
unique(mpg$class)
## [1] "compact" "midsize" "suv" "2seater" "minivan"
## [6] "pickup" "subcompact"
ggplot(mpg, aes(x = displ, y = cty, color = class)) +
geom_point()
Observation - This gives each point a unique colour corresponding to its class. The legend allows us read data values from the colour, showing us that the group of cars with unusually high fuel economy for their engine size are two seaters: cars with big engines, but lightweight bodies.
Note that using the aes() function will cause the visual channel to be based on the data specified in the argument. For example, using aes(color = “blue”) won’t cause the geometry’s color to be “blue”, but will instead cause the visual channel to be mapped from the vector c(“blue”) - as if we only had a single type of engine that happened to be called “blue”.
Alternate Approaches
ggplot(mpg, aes(x = displ, y = cty, size = class)) +
geom_point()
## Warning: Using size for a discrete variable is not advised.
ggplot(mpg, aes(x = displ, y = cty, alpha = class)) +
geom_point()
## Warning: Using alpha for a discrete variable is not advised.
ggplot(mpg, aes(x = displ, y = cty)) +
geom_point(color = "blue")
table(mpg$drv)
##
## 4 f r
## 103 106 25
unique(mpg$drv)
## [1] "f" "4" "r"
ggplot(mpg, aes(x = displ, y = hwy, color=drv, shape = as.factor(cyl))) +
geom_point(size=2)
unique(mpg$class)
## [1] "compact" "midsize" "suv" "2seater" "minivan"
## [6] "pickup" "subcompact"
boxplot(cty~class,data=mpg, main="City Mileage Data",
xlab="Vehicle Class", ylab="City Mileage",cex.axis=0.75)
ggplot(mpg, aes(class, cty,fill=class)) +
geom_boxplot()
p <- ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) +
geom_point()
Once you have a plot object, there are a few things you can do with it:
Render it on screen, with print(). This happens automatically when running interactively, but inside a loop or function, you???ll need to print() it yourself.
print(p)
# Save png to disk
ggsave("plot.png", width = 5, height = 5)
Building on these basics, ggplot2 can be used to build almost any kind of plot you may want. These plots are declared using functions that follow from the Grammar of Graphics. The most obvious distinction between plots is what geometric objects (geoms) they include.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This overlays the scatterplot with a smooth curve, including an assessment of uncertainty in the form of point-wise confidence interval shown in grey.
An important argument to geom_smooth() is the method, which allows you to choose which type of model is used to fit the smooth curve.
method = “loess”, the default for small n, uses a smooth local regression.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# color aesthetic specified for only the geom_point layer
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
If you look at the below bar chart, you’ll notice that the the y axis was defined for us as the count of elements that have the particular type. This count isn’t part of the data set (it’s not a column in mpg), but is instead a statistical transformation that the geom_bar automatically applies to the data. In particular, it applies the stat_count transformation.
ggplot(mpg, aes(x = class)) +
geom_bar(fill='blue')
class_count <- dplyr::count(mpg, class)
class_count
## # A tibble: 7 x 2
## class n
## <chr> <int>
## 1 2seater 5
## 2 compact 47
## 3 midsize 41
## 4 minivan 11
## 5 pickup 33
## 6 subcompact 35
## 7 suv 62
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ggplot(class_count, aes(x = class, y = n)) +
geom_bar(stat = "identity")
# An example would be if you want to sort the levels of bar chart!
# What if what I want to sort by frequency in the plot?
class_drive <- mpg %>% group_by(class) %>% summarize(freq = n())
ggplot(class_drive, aes(reorder(class, freq), freq)) +
geom_bar(stat= "identity") +
ggtitle("Total count")
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.2.1 --
## <U+221A> tibble 1.4.2 <U+221A> purrr 0.2.5
## <U+221A> tidyr 0.8.1 <U+221A> stringr 1.3.1
## <U+221A> readr 1.1.1 <U+221A> forcats 0.3.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## Warning: package 'forcats' was built under R version 3.4.3
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::combine() masks gridExtra::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x ggridges::scale_discrete_manual() masks ggplot2::scale_discrete_manual()
library(scales)
## Warning: package 'scales' was built under R version 3.4.4
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
class_drive <- mpg %>% group_by(class) %>% summarize(count = n())%>% mutate(percentage = count/sum(count)) # find percent of total
## Warning: package 'bindrcpp' was built under R version 3.4.4
class_drive
## # A tibble: 7 x 3
## class count percentage
## <chr> <int> <dbl>
## 1 2seater 5 0.0214
## 2 compact 47 0.201
## 3 midsize 41 0.175
## 4 minivan 11 0.0470
## 5 pickup 33 0.141
## 6 subcompact 35 0.150
## 7 suv 62 0.265
ggplot(class_drive, aes(class, percentage, fill = class)) +
geom_bar(stat = "identity") +
geom_text(aes(label=scales::percent(percentage)), position = position_stack(vjust = .5))+
scale_y_continuous(labels = scales::percent)
ggplot(mpg, aes(displ, hwy)) +
geom_point(color = "green") +
stat_summary(fun.y = "mean", geom = "line", size = 1, linetype = "dashed")
ggplot(data = mpg) +
stat_summary(mapping = aes(x = class, y = hwy),fun.ymin = min,fun.ymax = max,fun.y = median)
In addition to a default statistical transformation, each geom also has a default position adjustment which specifies a set of “rules” as to how different components should be positioned relative to each other. This position is noticeable in a geom_bar if you map a different variable to the color visual characteristic:
# bar chart of class, colored by drive (front, rear, 4-wheel)
ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar()
# bar chart of class, colored by drive (front, rear, 4-wheel)
# position = "dodge": values next to each other
ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar(position = "dodge")
# position = "fill": percentage chart
ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar(position = "fill")
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
# color the data by engine type
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point()
# same as above, with explicit scales
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
scale_x_continuous("Engine Displacement") +
scale_y_continuous("Highway Mileage") +
scale_colour_discrete()
Each scale can be represented by a function with the following name: scale_, followed by the name of the aesthetic property, followed by an _ and the name of the scale. A continuous scale will handle things like numeric data (where there is a continuous set of numbers), whereas a discrete scale will handle things like colors (since there is a small list of distinct colors).
A common parameter to change is which set of colors to use in a plot. While you can use the default coloring, a more common option is to leverage the pre-defined palettes from colorbrewer.org. These color sets have been carefully designed to look good and to be viewable to people with certain forms of color blindness. We can leverage color brewer palletes by specifying the scale_color_brewer() function, passing the pallete as an argument.
# default color brewer
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
scale_color_brewer()
# specifying color palette
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
scale_color_brewer(palette = "Set3") # Change palette = "RdBu" and notice
Note that you can get the palette name from the colorbrewer website by looking at the scheme query parameter in the URL. Or see the diagram here and hover the mouse over each palette for the name.
You can also specify continuous color values by using a gradient scale, or manually specify the colors you want to use as a named vector.
The next term from the Grammar of Graphics that can be specified is the Coordinate system. As with scales, coordinate systems are specified with functions that all start with coord_ and are added as a layer. There are a number of different possible coordinate systems to use, including:
coord_quickmap a coordinate system that approximates a good aspect ratio for maps. See documentation for more details.
# zoom in with coord_cartesian
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
coord_cartesian(xlim = c(0, 5))
bar <- ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, fill = class), show.legend = FALSE,width = 1) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
# bar + coord_flip()
bar + coord_polar()
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(colour="blue") +
facet_wrap(~ class, nrow=2)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(color="red") +
facet_grid(year ~ cyl)
Textual labels and annotations (on the plot, axes, geometry, and legend) are an important part of making a plot understandable and communicating information. ggplot makes it easy to add such annotations.
You can add titles and axis labels to a chart using the labs() function (not labels, which is a different R function!):
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
labs(title = "Fuel Efficiency by Engine Power",
subtitle = "Fuel economy data from 1999 and 2008 for 38 popular models of cars",
x = "Engine power (litres displacement)",
y = "Fuel Efficiency (miles per gallon)",
color = "Car Type")
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
labs(title = "Fuel Efficiency by Engine Power",
subtitle = "Fuel economy data from 1999 and 2008 for 38 popular models of cars",
x = "Engine power (litres displacement)",
y = "Fuel Efficiency (miles per gallon)",
color = "Car Type") +
theme_classic()